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Abstract 

List  ranking  and  list  scan  are  two  primitive  operations  used  in  many  parallel  algorithms  that  use  list,  trees, 
and  graph  data  structures.  But  vectorizing  and  parallelizing  list  ranking  is  a  challenge  because  it  is  highly 
communication  intensive  and  dynamic.  In  addition,  the  serial  algorithm  is  very  simple  and  has  very  small 
constants.  In  order  to  compete  a  parallel  algorithm  must  also  be  simple  and  have  small  constants  parallel 
algorithm  due  to  Wyllie  is  such  an  algorithm,  but  it  is  not  work  clhcieni-iis  perlormancc  degrades  lor 
longer  and  longer  linked  lists.  In  contrast,  work  etiicient  PRAM  algorithms  developed  it^  Jale  have  scry 
large  constants.  We  introduce  a  new  fully  vectorized  and  parallelized  algorithm  that  both  is  svork  elhcieni 
and  has  small  constants.  However,  it  does  not  achieve  (>|log«)  running  time  But  we  contend  that  work 
efficiency  and  small  constants  is  more  important,  given  that  vector  and  multiprocessor  machines  are  used 
for  problems  that  are  much  larger  than  the  number  of  processors  and.  therefore,  the  (){ log  »  |  time  is  never 
achieved  in  practice.  In  particular,  to  the  best  of  our  knowledge  our  implementation  of  list  ranking  and  list 
scan  on  the  Cray  C-9()  is  the  fastest  implementation  to  date.  In  addition,  it  is  the  first  implementation  of 
which  we  are  aware  that  outperforms  fast  workstations.  The  success  of  our  algorithm  is  due  to  its  relatively 
large  grain  size  and  simplicity  of  the  inner  loops,  and  the  success  of  the  implementation  is  due  to  pipelining 
reads  and  writes  through  vectorization  to  hide  latency,  minimizing  load  balancing  by  deriving  equations  for 
predicting  and  optimizing  performance,  and  avoiding  conditional  tests  except  when  load  balancing. 


1  Introduction 


As  production  parallel  and  vector  machines  become  faster  and  common  place,  solving  larger  and  larger 
problems  becomes  feasible.  However,  large  problems  that  have  irregular  sparsity  structure  or  are  dynamic 
are  often  most  efhciently  represented  and  manipulated  using  lists,  trees,  and  graphs.  Use  of  such  d.iia 
structures  ha.s  become  natural  and  common  on  sequential  machine,  but  have  been  shunned  in  par.illel 
implemenlJtions.  Theory  indicates  that  use  of  irregular  data  structures  can  signiticantly  reduce  ihe  prohleni 
size  and.  therefore,  can  improve  asymptotic  performance.  Many  Parallel  Random  Access  Machine  ( PR.-\.\|  i 
algorithms  for  such  data  structures  have  been  developed.  But  are  these  PRAM  algorithms  practical  ’  Can 
we  perform  even  the  most  primitive  operations  used  by  PRAM  algorithms  efhciently  ’  We  contend  that 
there  is  hope.  For  example,  scan  (prehx  sum)  is  such  a  pnmitive  operation  and  is  applied  to  arrays.  For 
each  element  in  the  array  it  computes  the  “sum”  of  all  the  preceding  elements  in  the  array,  where  "sum"  is 
a  binary  associative  operator.  'Elsewhere,  the  efficient  vector  parallel  implementation  of  the  scan  primitive 
has  been  shown  to  lead  to  greatly  improved  performance  of  several  applications  that  cannot  he  vectorized 
with  existing  compilers  |6.  25).  However,  this  version  of  scan  can  only  be  applied  to  arrays  that  are  linearlv 
ordered  in  consecutive  locations.  If  the  data  are  aored  unordered  and  the  ordering  is  provided  by  links  or 
pointers  then  we  need  to  use  other  approaches  for  scan. 

In  this  paper  we  consider  a  vector  parallel  implementation  of  list  ranking  and  the  scan  operation  .ipplicd 
to  linked  lists.  List  ranking  and  list  scan  are  two  fundamental  primitives  ih.it  are  commonly  used  in 
solving  problems  on  linked  lists,  trees,  and  graph  data  structures.  Parallel  algorithms  frequently  use  list 
ranking  for  ordenng  the  elements  of  a  list,  finding  the  Euler  tour  of  a  tree,  load  balancing  1 1 1 1,  contention 
avoidance  1 15,  1 1,  and  parallel  tree  contraction  1 17).  and  these  problems  are  subproblems  of  applications 
such  as  expression  evaluation,  graph  .Vconnectivity,  and  planar  graph  embedding  [IH]  In  addition,  list 
ranking  is  very  interesting  because  it  involves  the  kinds  of  problems  for  which  it  is  hard  to  get  good  vector 
or  parallel  performance.  In  particular,  it  uses  an  irregular  data  structure,  is  highly  communication  hound, 
and  its  communication  patterns  are  dynamic.  From  an  algorithmic  point  of  view  it  is  interesting  because  it 
has  features  common  to  many  problems;  contention  avoidance  and  load  balancing. 

List  rankini;  finds  the  position  of  each  node  in  the  list,  by  counting  the  number  of  links  between  each 
node  and  the  head  of  the  list.  This  position  information  can  be  used  to  reorder  the  nodes  of  the  list  into 
an  array  in  one  parallel  step.  Then,  for  example,  ^can  can  be  applied  to  the  array  .Alternatively,  scan  can 
be  applied  directly  to  the  linked  list.  We  call  this  operation  li.u  si  an  and  for  each  node  in  the  linked  list  u 
computes  the  "sum"  of  the  values  of  the  all  prior  nodes  in  the  list.  List  ranking  and  list  scan  .ire  related  in 
that  list  ranking  is  the  list  scan  where  plus  is  the  operator  and  the  values  to  be  summed  are  all  equal  to  one. 

In  the  comprehensive  review  of  PRAM  list  ranking  algorithms  by  Halverson  and  Das  1 1  .f  |  there  is  onlv 
one  reference  to  implementation  of  list  ranking,  which  was  Wyllie's  algorithm  on  the  CM-2  The  only 
other  parallel  implementations  of  list  ranking  of  which  the  author  is  aware  use  a  random  pointer  lumping 
technique.  Wyllie's  algorithm  |231  is  work  inefficient  since  it  takes  ()l  n  log  n  )  operations  on  a  n  element 
list,  whereas  a  serial  implemtation  takes  ()(  n )  operations.  But,  because  it  is  very  simple  it  works  well  for 
short  lists  or  when  we  can  increa.se  the  number  of  processors  according  to  the  linked  list  size.  On  the  other 
hand,  the  random  pointer  jumping  technique  1 17.  .^|  suffers  from  having  to  take  multiple  iri.iK  on  .iv  er.ige 
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Table  1  Compariion  of  several  list  ranking  algorithms,  where  n  is  the  length  of  the  list,  //is  the  number  of  processors, 
and  m  is  a  parameter  of  our  algorithm  (in  <  n/  log  ii.  and  for  the  Cray  C-9()iii  ^  0(1  log  o  )'))■ 

before  being  able  to  perform  a  pointer  jump  and.  therefore,  results  in  larger  constants.  Other  work  efficient 
parallel  PRAM  list  ranking  algorithms  have  very  large  constants,  which  has  inhibited  their  implementation 
Table  I  gives  a  comparison  of  list  ranking  algorithms,  and  Figure  I  compare-,  the  running  times  of  five  list 
ranking  algorithms  on  one  processor  of  the  Cray  C-9().  The  Miller/Reif  and  Anderson/M i Her  algorithms 
use  random  pointer  jumping,  and  the  Belloch/Reid-Miller  algorithm  is  the  one  on  which  we  report  here. 
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Figure  I:  Execution  times  per  element  tor  several  list  ranking  algorithms  on  one  processor  ot  the  Cray  C  'h)  Tlie 
times  for  Wyllie's  algorithm  and  our  algorithm  were  obtained  on  a  dedicated  machine.  Tlie  saw  tooth  shape  ot  the 
Wyllie  curve  is  due  to  the  algorithm  performing  flog  ii  -  I’  rounds  tit  pointer  jumping  over  all  the  data. 

We  introduce  a  new  fully  vectorized  and  parallelized  algorithm  that  both  is  work  efficient  and  ha.s  small 
constants.  However,  it  does  not  achieve  (){ log  « )  running  time.  Bui  we  contend  that  work  efhciency  and 
small  constants  is  more  important,  given  that  vector  and  multiprocessor  machines  are  used  for  problems 
that  are  much  larger  than  the  number  of  processors  and,  therefore,  the  ()[  log  n  )  time  is  never  achieved  m 
practice.  For  lists  shorter  than  70(X)  elements  Wyllie’s  algorithm  is  faster  than  ours.  But  for  long  lists  our 
implementation  of  list  ranking  and  list  scan  on  the  Cray  C-9()  is  the  fastest  implementation  to  date,  to  the 
best  of  our  knowledge.  In  addition,  it  is  the  first  implementation  of  which  we  are  aware  that  outperforms 
fast  workstations.  For  example,  it  achieves  over  two  orders  of  magnitude  speedup  over  a  DECstation  .“'(KK) 
workstation.  On  a  single  processor  it  also  achieves  a  factor  of  four  speed  up  over  a  serial  list  scan  on  the 
Cray  C-9(),  which  is  significant  since  Cray  computers  are  also  very  fast  scalar  machines  (sec  fallacy  m 
Section  7.8  of  1 141).  In  particular,  when  vectorizing  a  serial  problem  that  requires  gather/scatler  operations, 
the  best  speedup  one  can  expect  on  a  single  processorCRAY  C-9()  is  about  a  factor  of  1 2- 1 8;  if  the  vectorized 
algorithm  doe.s  twice  as  much  work  as  the  serial  code  (both  a  reduction  and  contraction  phase  as  our  does) 
then  the  best  you  can  expect  is  a  6-9  fold  speedup  on  one  pr(x;essor.  We  obtain  an  addition  6.7  speedup  nn 


8  processors.  In  addition,  our  algorithm  uses  much  less  space  than  other  algorithms,  including  W  v  llie's 

1.1  Vector  Multiprocessors  as  PRAMs 

We  chose  to  implement  list  ranking  on  a  vector  multiprocessor  because  these  machines,  such  as  the  Cr an 
family  of  computers,  closely  approximate  the  abstract  EREW  PRAM  machine,  sec  Figure  2.  These  machines 
use  a  shared  memory  model,  have  tine-grain  access  to  memory,  have  e.xtremely  high  global  communication 
bandwidth,  and  can  hide  functional  and  memory  latencies  through  vectorization.  The  most  important 
feature  that  distinguishes  these  machines  from  MMP  machines  is  the  pipelined  memory  access.  Processors 
communicate  to  memory  via  a  multistage  buttertly-like  interconnection  network.  As  long  as  there  are  no 
memory  bank  conflicts,  the  network  can  service  one  memory  request  per  clock  cycle  Thus,  the  PRAM 
model  assumption  that  often  is  cited  as  unrealistic,  namely  memory  access  takes  one  unit  time,  holds  on 
vector  multiprocessors  as  long  as  we  can  avoid  memory  bank  conflicts  and  hide  latencies. 
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Figure  2:  Vector  multipr(Kcs,sors  as  viewed  as  a  PRAM 

Zagha  proposes  several  vector  multiprocessing  programming  techniques  for  avoiding  bank  conflicts 
and  hiding  latencies  124|.  To  address  bank  conflicts  he  proposes  a  data  distribution  technique  to  manage 
explicitly  the  memory  system.  To  address  memory  and  functional  units  latencies,  he  proposes  Mriual 
processing,  which  is  ba.sed  on  Valiant's  Bulk  Synchronous  Processor  (BSP)  model  |21|.  This  unifying 
model  requires  that  algorithms  are  designed  with  sufficient  iMrnlld  sUu  kness.  so  that  programs  are  written 
for  rather  more  virtual  processors  than  physical  processors.  For  vector  multiprocessors,  this  slack  allows 
for  vectonzation  so  that  computations  and  communication  can  be  pipelined  to  hide  latencies 

In  ’’agha's  programming  model  we  implement  PRAM  algorithms  by  treating  a  vector  processor  as  .i 
SIMD  (distributed  memory)  multiprocessor,  where  each  element  in  the  vector  acts  as  a  processor  of  the 
SIMD  machine,  see  Figure  2.  Because  processors  in  data  parallel  algorithms  do  not  use  the  results  of  another 
processor  in  the  same  time  step,  there  are  no  recurrences  to  worry  about  in  the  corresponding  vectorized 
implementation.  Extending  the  vectorized  algorithm  to  vector  multiprocessors  is  straightforward  if  the 
machine  is  SIMD;  simply  treat  the  vector  multiprocessor  as  a  /  <  />  SIMD  multiprocessor,  where  /  is  the 
length  of  the  vector-registers  and  p  is  the  number  of  processors,  and  apply  the  s  ectorized  algorithm.  If  the 
machine  is  MIMD,  it  can  be  treated  the  same  way  except  that,  for  efficiency,  the  number  synchronization 
points  should  be  minimized. 


The  paper  is  organized  as  follows.  In  Section  2  we  discuss  the  five  list  ranking  algorithms  we  imple¬ 
mented.  Section  3  describes  our  implementation  on  the  Cray  C-90  and  gives  timing  equations  for  each 
part  of  the  implementation.  In  Section  4  we  analyze  the  expected  performance,  desenbe  how  we  tuned  the 
parameters,  and  give  our  overall  performance  results.  In  Section  5  we  describe  the  multiprocessor  version 
of  the  algorithm  and  its  performance  and  review  other  PRAM  list  ranking  algorithms.  Finally,  in  Section  ft 
we  discuss  our  conclusions  and  future  directions. 


2  The  List  Ranking  and  List  Scan  Algorithms 

List  ranking  computes  the  distance  each  node  is  from  the  head  of  the  linked  list.  List  scan  computes  the 
“sum”  of  the  values  on  the  links  in  a  linked  list  from  the  head  of  the  linked  list  to  each  node  in  the  linked 
list,  where  “sum"  is  any  binary  associative  operator.  Since,  list  ranking  is  a  list  scan  with  all  weights  equal 
to  one,  we  discuss  list  scan  only.  For  simplicity  we  use  integer  addition  as  the  ‘■sum”  opierator.  We  represent 
the  linked  list  as  a  pair  of  arrays.  The  value  array  contains  the  value  of  each  node  of  the  list  and  the  link 
array  contains  the  index  of  the  next  node  in  the  list.  The  tail  of  the  list  is  a  self-loop.  le.  the  link  at  the  tail 
is  the  index  of  the  tail  node. 

2,1  The  serial  algorithm 

The  serial  list  scan  simply  walks  down  the  list  saving  the  accumulated  values  of  the  previous  nixies  until  it 
reaches  the  end  ot  me  list.  On  die  Ck.w  C-9()  it  takes  44  dock  cycles  or  l«i>  nsec  to  traverse  each  element 
of  the  list,  see  Figure  1.  and  can  be  coded  as  follows.  Let  the  array  Utext  represent  the  linked  'ist  where 
each  element  contains  the  index  of  the  next  nixie  in  the  list.  The  tail  of  the  list  is  indicated  by  a  self-loop, 
ie,  if  tail  is  the  index  of  the  last  element  in  the  list  then  /aie.ti|r«t/]  =  tail. 

Serial  XisLScani  l^um.  l.mlue.  head) 

{ 

/  *  l^um  -  list  scan  results 

*  Ijiexi  -  linked  list  terminated  with  a  self  loop 

*  Lvalue  -  values  of  the  nodes 

*  head  -  head  of  linked  list 
*! 

sum  =  ZERO', 
next  =  head'. 
do  { 

=  sum'. 

sum  -(-=  t .value[next\', 
next  =  I  jtext[next\', 

)  while  (next  ^  lMext[next\ ); 

} 
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2.2  Wyilie's  algorithm 


The  first  purallel  algorithm  for  list  ranking  is  due  to  Wyllie  |231  The  algorithm  uses  a  technique  common 
to  all  parallel  list  ranking  algorithms,  "pointer  jumping”  or  "shortcutting”  A  processor  is  assiKiaied  \vith 
each  node  of  the  list.  Each  processor,  in  parallel.  intHlifies  its  nest  pointer  Ijiexi  to  point  to  its  successor's 
successor.  For  each  round  of  pointer  jumping  the  number  of  list  elements  that  Ijieu  jumps  oser  can  double 
from  the  previous  iteration.  The  actual  number  of  elements  m  jumps  iiver  is  retained  m  /  uihu-  arras 
This  array  is  computed  by  adding  an  element's  value  with  its  successor's  value  during  each  pointer  jump 
After  flog,  n]  rounds  of  pointer  jumping  all  elements  point  to  the  tail  oi  me  list  and  kII .mi m  contains 
the  distance  each  node  if  from  the  tail  of  the  list.  The  data  parallel  version  of  the  inner  loop  of  Wyllie's 
algorithm  is  as  follows: 

Wyllie ioopi  Ijittxt.  l.uilut;.  n  I 

i 

/  •  Ijtexi  -  next  link 

*  1. value  ~  sum  of  values  of  list  between  self  and  next 

•I 

for  ( /  =  0;  /  <  fi;  (  ++  I  { 
next  =  /jte.trfi]; 

Lvali(e{i\  =  Lvultte[i]  +  l.valiiviicxt]: 
tjiext[t\  =  ljtext[next\'. 


] 

For  each  statement  we  must  ensure  that  all  the  data  are  read  before  they  are  vvniien  hack  to  the  same 
array.  We  accomplish  this  in  our  vector  multiprocessor  version  by  writing  to  a  dillereni  array  Irom  which 
we  are  reading.  Then,  on  each  call  to  the  inner  ioop  we  .w  :fch  back  and  forth  between  arrays  we  read  from 
and  arrays  we  write  to.  The  simplicity  of  this  algorithm  makes  n  quite  attractive.  However,  there  are  two 
main  problems  with  Wyllie's  algorithm. 

•  On  each  iteration  the  number  of  nodes  of  the  list  that  concurrently  read  the  values  at  the  tail  doubles 
At  the  Hnal  iteration  anywhere  from  half  the  mxJes  to  all  but  one  of  the  nodes  may  concurrently  read 
the  values  at  the  tail.  On  Cray  X-MP/Cray  Y-MP  computers  concurrent  reads  are  serialized 

•  The  algorithm  is  not  work  efficient  and  does  log ;;  times  as  much  work  as  the  serial  algorithm 

Figure  3  shows  the  run  times  of  Wyllie's  algonthrr  on  I  to  8  processors  of  the  Cray  C-dO,  The  saw  tooth 
shape  of  the  curves  is  due  to  the  addition  of  another  round  of  pointer  jumping  vvhenever  ,Toe  //  -  I  ;  chances 
value.  The  negative  slope  between  a  pair  of  teeth  is  due  to  the  amortization  of  the  additive  constant  terms 
over  larger  size  list*'  As  you  can  see  from  the  tigurv,  Wyllie's  algorithm  quickly  degrades  m  performance 
as  the  list  lengths  grow.  However,  it  does  scale  linearly  with  the  number  of  processors 
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FiRurc  3;  The  running  time  per  element  to  perform  Wyllie's  list  ranking  algorithm  on  I.  2.  4.  and  H  protessiir  s  on 
the  Cray  C-'X).  Whenever  [log/i  -  1]  increases  by  one  there  is  acorresponding  jump  in  the  per  element  running  time 
of  the  algonthm.  where  «  is  the  list  length.  The  implementation  on  one  priKessor  has  no  overhead  due  to  mulmasking 
and.  hence,  performs  better  on  small  lists  than  the  multiprocessor  version 

13  Random  mate 

One  of  the  simplest  work  efficient  parallel  algonthms  wxs  devised  by  Miller  and  Reif  1 17.  20|  It  used 
randomization  to  break  contention  so  that  processors  at  neighboring  nodes  do  not  attempt  to  dereference 
their  successor  pointers  simultaneously  Once  a  processor  “splices  out"  a  successor  node,  the  processor  lor 
the  successor  node  becomes  idle.  At  each  iteration  only  j  of  the  remaining  nodes  are  spliced  out  average 
After  (){ log  II )  rounds  all  the  nodes  either  point  to  the  tail  of  the  list  or  have  been  spliced  out.  Kinally.  there 
IS  a  reconstruction  pha.se.  in  which  spliced  out  nodes  are  reintnxluced  in  a’verse  order  Irom  w  hich  they  were 
removed.  We  implemented  this  algorithm  on  a  single  processor  of  the  Cray  C-^Xf  Our  version  removes 
idle  processors  by  packing  the  vectors  on  every  round  in  order  to  make  the  implementation  work  elhcient. 

Anderson  and  Miller  (.1.  2()|  modified  the  above  algonthm  so  that  it  avoids  load  balancing  (packing) 
Processors  are  assigned  the  work  of  log  n  nodes.  At  each  round  a  processor  attempts  to  remove  one  node  in 
its  queue  of  nodes.  However,  in  order  to  splice  out  its  own  node,  the  processor  needs  reverse  link  pointers 
so  that  it  can  get  the  previous  node  to  jump  over  the  processor's  nixic.  If  a  processor  is  able  to  splice 
out  its  node  in  one  round,  in  the  next  round  it  attempts  to  splice  out  the  next  node  m  its  queue.  In  this 
simple  way  processors  remain  busy  without  load  balancing  being  required.  After  about  4  log  n  rounds  about 
(){ II  /  log  II )  nodes  are  left,  at  which  point  they  can  be  compressed  in  memory  and  Wyllie's  algorithm  can  be 
applied.  Finally,  there  is  reconstruction  phase  to  reintroduce  spliced  out  nodes.  Again  only  a  ^mall  constant 
proportion  ( >  1  /4)  of  the  processors  remove  nvxles  on  each  round.  In  our  implementation  of  this  algorithm 
we  did  not  apply  Wyllie’s  algonthm.  We  simply  stopp-d  processors  from  attempting  to  splice  out  mxles 
once  they  had  completed  their  block  of  nodes. 

Both  implementations  of  the  random  mate  approach  arc  an  orderof  magnitude  slower  than  our  algorithm 
on  one  processor,  and  should  be  similarly  slower  on  multiple  processors,  since  all  the  algorithms  scale  almost 
linearly  on  multiple  processors,  sec  Figure  I.  TTtey  are  also  slower  than  the  serial  implementation  on  one 
processor.  Although  we  did  not  spend  much  effort  tuning  these  implementations,  we  doubt  that  we  could 
get  more  than  a  factor  of  two  improvement  in  their  running  time. 
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2.4  Our  parallel  algorithm 

Manv  other  work  efticicnt  and  optimal  PRAM  algorithm  have  been  developed  tor  list  ranking  Most  use 
to  iiract-rank-e.xpand  phases  and  address  two  considerations.  One  is  how  to  find  elements  on  which  to  work 
to  keep  all  the  processors  busy  and  the  second  is  how  to  avoid  contention  so  that  two  processors  are  not 
working  on  neighbonng  list  elements  (21.  We  deal  with  contention  by  randomly  breaking  up  the  linked  list 
of  length  <1  into  m  sublists  that  can  be  processed  independently  and  in  parallel  The  list  ranking  proceeds 
in  three  phases: 

Phase  1  Randomly  divide  the  list  in  m  sublists.  Reduce  each  sublist  to  a  single  node  with  value  equal  to 
the  “sum"  of  the  values  in  the  sublist.  Now  the  list  is  of  length  m. 

Phase  2  Find  the  list  scan  of  the  reduced  list  found  in  Phase  1.  These  values  are  the  scan  values  for  the 
heads  of  the  sublists. 

Phase  3  Expand  the  nodes  in  the  reduced  list  back  into  the  original  linked  list  hlling  in  the  scan  values 
along  the  list. 

Phase  I  and  .1  can  be  done  in  parallel.  The  list  scan  in  Phase  2  can  be  done  recursivelv  tor  large 
/It,  using  Wyllie’s  pointer  jumping  technique  |2J|  for  mixJerale  size  m,  or  serially  lor  small  m  For 
small  III  serial  list  ranking  works  best  because  it  avoids  the  overhead  associated  with  multiprocessing  and 
tilling  vector  pipes  (see  Figure  I).  Wyllie's  algorithm  perlorms  best  on  moderate  si/e  lists  where  it  can 
take  advantage  of  vectorization  and  multiprocessing  and  where  Hog  » .  is  small.  For  large  m  we  use  mir 
algorithm  recursively,  until  the  number  of  sublisis  becomes  'mall  enough  to  use  either  the  serial  or  Wyllie's 
algorithm.  We  determined  empirically  the  size  <«  should  be  when  we  swiicti  between  algorithms 
There  are  two  problems  with  our  algorithm  that  make  it  appear  poor  theoretically 

•  TTicsublisislengthsvary  a  great  deal,  from  approximately  :7  In  (  7^)  to  -  Ini  mi  on  average,  w  here 
In  is  log  base  < .  Thus,  the  processors'  work  is  imbalanced. 

•  Since  the  expected  length  of  the  longest  sublist  is  approximately  Ini  m  1  the  parallel  running  lime 

can  be  no  better  than  that,  ie  ()l  Ini  m  11.  m  ••.  i>  /  log  n.  In  contrast,  there  are  many  parallel 

alsorithms  that  have  dOi  -  a-  los*,  »  1  mnnine  lime. 

s-  y  W  . 

We  ameliorate  both  problems  by  requiring  m  to  be  much  greater  than  the  number  ot  processors.  ;>  In 
this  way  a  processor  is  responsible  for  several  lists,  namely  m/ ;i.  Periodically  we  pertorm  load  balancing, 
to  regroup  the  lists,  which  addresses  the  tirst  problem.  If  />  •  m  ■  In  m  then  the  runnmg  lime  is  dominated 
by  11/ /I  and  the  length  of  the  longest  sublist  is  not  a  problem. 

The  primary  advantage  to  our  algorithm  is  that  it  is  both  work  elticient  and  has  very  small  constants 
Overall  the  algorithm  is  fully  vector  parallel,  and  scales  almost  linearly  with  the  number  ot  priKcssors 
Figure  4  shows  the  speedup  relative  to  one  priKessor  for  various  size  lists. 

In  the  following  description  we  assume  that  there  is  one  virtual  priKessor  lor  every  sublist  Physical 
processors  are  assigned  to  do  the  work  of  an  equal  number  of  virtual  privessors  Because  the  algorithm  is 
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n  ■  J2M 

n•2n«(K 


If  1»K 


n>l<K 


Figure  4:  Reladve  speetiups  of  our  list  ranking  algoridim  on  the  Ckay  C-Vt) 

data  parallel  the  physical  processor  performs  one  step  on  each  virtual  prtKcssor  hefore  prtKcedinj:  to  the 
next  step.  The  algorithm  proceeds  in  three  stages. 

Input:  The  head  of  a  linked  list,  the  linked  list,  and  its  associated  list  values.  If  assumes  that  ihe  linked 
list  is  in  contiguous  memory  and  terminates  with  a  ^elf  ItMvp. 

Output:  The  list  scan  of  the  linked  list. 

Initialization:  Each  processor  picks  a  random  position  m  the.  linked  list  to  he  the  tail  of  a  suhlist  It 
saves  the  position  of  the  tail  and  the  successor  link,  and  sets  the  tail  to  a  self  loop.  It  then  prepares  to 
find  the  list  scan  of  the  following  sublist  (which  another  processor  created)  It  initializes  its  sublist  head  as 
the  successor  link  saved  above  and  the  list  sum  to  :ew.  where  :cn>  is  the  identity  of  the  list  sum  operator 
Figure  5  shows  the  linked  list  that  is  the  input  of  the  list  scan  algorithm  and  the  result  of  the  initialization 
step. 

One  processor  is  responsible  for  finding  the  list  scan  of  the  first  sublist  whose  head  is  the  head  of  the 
whole  list.  This  processor  is  also  responsible  for  creating  the  (unknown)  hnal  sublist  But  since  the  linai 
sublist  already  is  terminated  with  a  self  loop  it  does  nothing  to  create  it  We  do  not  let  a  processor  chiHise 
the  tail  of  the  whole  list  as  its  random  position  because  it  is  convenient  not  to  worry  about  a  zero  length  list 
in  Phase  2. 

It  is  possible  that  two  processors  will  pick  the  same  random  position  at  which  to  break  the  list.  Then 
the  two  (jrocessors  will  duplicate  each  other's  actions  and  cause  coni-mion.  We  can  either  use  a  parallel 
algorithm  chat  guarantees  to  find  no  duplicate  random  numbers,  such  as  in  |  lh|.  or  we  can  remove  duplicate 
random  numbers  by  having  a  competition  among  the  processors  Each  priKcssoi  writes  its  index  at  its 
random  location  and  then  after  everyone  has  written  their  index  it  reads  hack  the  index  at  that  liwation  If 
the  index  is  not  its  own  it  knows  that  it  is  a  duplicate  processor  and  can  dnvp  out  of  the  computation  The 
hrst  approach  uses  mod  anthmetic.  which  is  relatively  slow  on  the  Cray  and  the  second  .ipproach  mav 
require  a  pack,  which  to  do  efficiently  is  quite  complicated,  see  15). , 


H 


list; 


Figure  5:  At  the  top  ot'  the  hgure  is  the  initial  link  list  with  its  values  at  each  node.  At  the  h.iitom  »t  the  heure  is 
the  results  ot' initialization.  The  linked  list  is  divided  into  3  sublisis.  each  terminating  with  a  sell  loop  Each  processor. 
I'n.  P\ .  Py.  saves  two  values:  its  chosen  random  position.  It.  and  the  successor  ot  ihe  ranil.im  position  in  the  original 
linked  list,  which  becomes  the  head  of  its  subiist,  //.  Each  processor  also  initializes  its  sublisi  sum  s  to  u  ni,  the 
identity  of  the  scan  operator. 


Phase  1:  Each  virtual  processor  traverses  its  sublist  adding  the  values  along  the  links  to  the  sum.  Whc'ii 

a  virtual  processor  reaches  the  tail  of  its  sublist  It  “drops  out"  of  the  computation  Every  '  =  1.2.3 . / 

steps  a  load  balancing  step  is  done,  reassigning  virtual  processors  that  have  not  completed  their  sublisi  to 
the  physical  processors.  Each  time  they  load  balance  they  increment  i  to  get  a  new  »  Because  ot  the  fairly 
predictable  sizes  of  the  sublists  we  can  determine  what  are  reasonable  values  of  >  (see  Section  4).  Figure  b 
shows  the  status  after  every  processor  has  dropped  out  and  has  found  the  sum  of  its  sublisi.  Next  the  \  iriual 
processors  create  Ihe  reduced  list  of  sublists  sums. 


RH  TS  RHTS  Kills 


Figut«  6:  The  hgure  show  the  results  of  Knding  the  sum  of  each  sublist.  Each  priKcssor  has  traversed  in  .aliiisi 
until  It  has  reached  the  sublisi  tail.  /'.  and  has  accumulated  the  "suin  '  ot  ihe  values  along  ihe  sijhlisi.  ' 

At  this  point  each  virtual  processor  has  reached  the  tail  of  its  sublisi.  It  also  has  the  tail  of  the  previous 
sublis'.  which  is  Ihe  random  position  it  chose  during  initialization.  By  writing  the  processor's  index  into  the 
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tail  of  the  previous  sublist  and  then  reading  the  index  at  the  tail  of  its  own  sublist,  the  processor  determines 
the  index  of  its  successor's  sublist.  From  this  index  the  processor  creates  a  link  from  its  sublist  sum  'o  its 
successor  sublist  sum  to  form  a  new  shorter  linked  list,  see  Figure  7. 


F1fnm7:  The  figure  shows  finding  the  reduced  list  of  sublisi  sums  during  Phase  I.  Each  priKcssor  writes  its  index 
at  its  random  position  in  the  linked  list.  R.  and  reads  the  index  written  at  the  tail  of  its  sublisi.  /  Tins  index  is  the 
index  of  the  processor  with  the  successor  sublist.  The  tail  sublist  finds  no  index  at  the  tail  of  its  sublist. 


For  example,  consider  the  tail  of  the  first  sublist  in  Figure  7.  The  random  position  for  PnKessor  2  is 
the  tail  of  the  first  sublist.  Processor  2  writes  2  at  that  tail,  ie  the  tail  of  the  sublist  previous  to  Processor  2  s 
sublist.  Then  Processor  0  reads  fhe  index  at  (he  (at/  of  its  own  sublist.  namely  the  first  sublist.  The  v.ilue  is 
2.  the  processor  index  of  its  successor  sublist.  Thus.  Processor  0  links  its  sublist  to  the  sublist  at  Processor 
2, 

The  tail  of  fhe  new  linked  list  corresponds  to  the  tail  sublist.  A  processor  can  determine  whether  its 
sublist  is  the  tail  sublist  becau.se  no  processor  wrote  its  index  in  the  tail.  The  processor  for  the  tail  sublist 
can  now  set  the  tail  of  fhe  new  reduced  linked  list  to  a  self  loop.  The  values  of  each  nixJe  of  the  reduced 
sublist  is  the  sublist  sums  found  in  Pha.se  I . 

Phase  2:  Depending  on  the  size  of  this  new  linked  list  the  algorithm  finds  the  scan  of  the  reduced 
linked  list  recursively,  using  Wyllie's  algorithm,  or  serially.  Figure  8  shows  the  result  of  this  phase  of  the 
algorithm. 


Figure  8:  The  list  scan  on  the  reduced  list  of  Nublisi  \ums. 

Phase  3:  Phase  3  starts  with  the  scan  value  found  in  Pha.se  2  as  the  scan  value  for  the  head  of  its  sublist, 
see  Figure  9.  Each  virtual  processor  hnds  the  scan  of  the  remaining  nodes  in  its  sublist  m  the  same  manner 
as  in  Pha.se  I .  It  traverses  its  sublist  setting  fhe  scan  of  each  node  to  the  sum  of  the  scan  and  value  of  the 
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previous  node.  Again,  after  .  i  =  1.2.3 . /  steps  load  balancing  is  done. 


Flgurt  9:  The  scan  of  the  reduced  list  found  in  Phase  2  are  the  scan  values  for  the  heads  uf  the  sublists. 

Restoration:  Finally  each  virtual  processor  reconnects  the  sublists  to  form  the  onginal  linked  list,  using 
the  values  saved  during  initialization.  That  is.  each  processor  except  Processor  0  replaces  the  self  loop  at  its 
random  position  with  link  to  its  sublist  head.  Figure  10  shows  the  linal  scan  values  and  the  restored  linked 
list  that  is  the  result  of  the  completed  algorithm. 


Figure  10:  The  resulting  scan  values  of  the  linked  list  found  in  Phase  .V  Tlie  links  ai  the  tails  of  the  sublisis.  K.  are 
replaced  by  links  to  the  sublist  heads.  //.  to  restore  the  linked  list  to  iis  original  form. 


3  Vector  Implementation  of  List  Scan 

We  implemented  our  list  scan  algorithm  on  a  Cray  C-90.  a  vector  multiprocessor.  Vector  multiprocessor 
machines  consist  of  multiple  scalar  processors,  each  augmented  with  a  bank  of  vector  registers,  pipelined 
functional  units,  and  vector  instructions  The  functional  units  divide  their  operation  into  several  stages  so 
that  the  clock  speeds  can  be  increa.sed.  On  every  clock  cycle  another  element  from  the  vector  register  enters 
the  pipeline  of  the  functional  unit  while  one  results  exits  the  pipeline.  The  delay  between  the  time  the  hrst 
operands  enters  the  pipeline  until  the  Hrst  result  leaves  the  pipeline  is  call  the  latency  or  start  up  lime  of  the 
functional  unit.  If  the  funcfional  units  are  fully  pipelined,  they  can  accept  new  operands  every  cltKk  cycle. 
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Multiple  functional  units  can  process  data  simultaneously,  and  if  the  hardware  permits,  the  results  of  one 
functional  unit  can  be  chained  directly  to  the  input  of  another  unit.  Thus,  to  execute  operands  of  length  ii 
on  a  fully  pipelined  functional  unit  with  start  up  time  >  takes  >  +  n  clock  cycles,  where  n  <  r.  the  vector 
register  length.  To  hide  the  latency  we  wont  »  os  close  to  r  as  possible. 

The  vector  multiprocessors  are  typically  connected  through  a  multistage  interconnection  network  to  a 
common  memory.  Memory  is  composed  of  multiple  memory  banks  that  can  access  different  addresses  in 
parallel  using  a  single  global  address  space.  Once  a  memory  bank  has  been  accessed  it  cannot  be  access 
again  until  there  is  a  delay,  called  the  cycle  time.  Usually  the  memory  is  optimized  for  sequential  access. 
That  is,  banks  ate  fully  interleaved  so  that  successive  addresses  ore  on  successive  memory  banks  and  t.ie 
number  of  memory  banks  is  greater  than  the  cycle  time.  In  this  way,  memory  can  be  accessed  sequentially 
one  element  per  clock  cycle.  Vectors  con  be  loaded  and  stored  sequential  (stride  =  I )  or  at  every  A element 
(stride  =  k)  of  memory.  Bad  choices  for  k-  can  result  in  the  same  memory  bank  being  accessed  at  a  rate 
higher  than  the  cycle  time  and  a  memory-bank  confiict  occurs,  causing  memory  stalls.  Memory  can  also 
be  accessed  at  arbitrary  locations  using  an  index  vector.  Often  such  loads  are  called  gather  operations  and 
stores  are  called  scatter  operations.  Because  of  memory  bank  conflicts  each  element  during  a  gather  or 
scatter  typically  is  accessed  at  a  rate  lower  than  the  machine  clock  cycle  (about  2  clock  cycles/element  for 
random  access  patterns  on  the  Cray  Y-MP  machines).  In  addition  to  cycle  time  delays  there  are  access 
latency  time  delays.  The  latency  for  a  load  is  the  time  to  get  the  tirst  word  from  memory  to  the  register  and 
often  is  much  greater  than  the  latency  of  the  (unctional  amts.  However,  for  Cray  Y-MP  machines  memory 
access  latencies  are  about  the  same,  but  are  getting  longer  as  the  machines  get  bigger. 

When  we  implement  a  PRAM  algorithm  on  vector  multiprocessor,  we  treat  each  element  in  a  vector 
register  os  an  element  processor  of  a  SIMD  machine.  We  call  the  vector  element  an  element  processor  to 
distinguish  it  li  ^m  a  full  vector  processor.  Any  data  parallel  algorithm  can  be  vectorized  and  parallelized 
by  having  an  element  processor  do  the  work  of  a  virtual  processor  in  the  algonthm.  For  example,  on  a 
Cray  C-90  each  processor  has  a  vector-register  length  of  128  and  there  arc  16  processors.  Therefore,  we 
can  have  as  many  as  2048  element  processors.  However,  by  using  strip-mining  1 19)  or  loop-raking  |25,  .SJ 
we  can  assign  the  work  of  several  virtual  processors  to  a  single  element  processor. 

We  implemented  our  list  scan  algorithm  using  C  and  the  standard  Cray  C  compiler  on  a  Cray  C-9() 
Because  many  of  our  vector  operations  use  indirect  addressing  we  needed  to  give  compiler  directives  in 
order  to  get  the  compiler  to  vectorize  the  loops,  the  vmly  ponion  that  is  not  vectorizable  is  the  serial  list 
scan  in  Phase  2.  All  loops  we  present  in  this  section  can  be  vectorized.  In  the  actual  implementation,  we 
attempted  to  reorder  the  statements  within  a  loop  in  order  to  hll  the  multiple  functional  units  for  concurrent 
operations,  to  avoid  contention  between  input/output  memory  ports  and  the  gather/scatter  hardware,  and  to 
avoid  write  after  read  dependencies  f  I2|.  Chaining  is  also  possible  within  loops.  Becau.se  memory  access 
is  dependent  on  the  data  there  is  nothing  we  could  do  to  avoid  memory  bank  conflicts,  except  possibly 
randomizing  the  input..  However,  since  we  are  choosing  random  positions  for  the  heads  of  the  sublists, 
systematic  memory  bank  conflicts  are  unlikely. 

In  this  section  we  give  pseudo  C  code  to  illustrate  our  implementation  of  the  single  vector  processor 
version.  In  section  5  we  show  how  we  modified  this  algorithm  for  a  vector  multiprocessor  machine.  Below 


we  use  three  structures  containing  sets  of  vectors  to  simplify  the  presentation.  However,  in  our  actual 
implementation  we  use  individual  vectors.  For  each  subroutine  we  discuss  the  vectorization,  develop  an 
equation  for  estimating  the  execution  time,  and  present  the  C  pseudo  code. 

List-Scan  We  treat  the  linked  list  ff  os  a  pair  of  vectors  where  one  vector  U.next  gives  the  indices  of  the 
successive  nodes  in  the  linked  list  and  the  other  vector  U. value  gives  the  values  of  the  nodes.  The  scalar 
U.head  is  the  index  of  the  head  of  the  linked  list  and  the  scalar //.« is  the  length  of  the  linked  list.  We  assume 
that  the  linked  list  terminates  with  a  self  loop.  The  resulting  list  scan  will  be  store  in  the  vector  II. sum. 

We  use  another  set  of  vectors  vp  that  represent  the  virtual  processors,  which  we  periodically  pack  as 
processors  drop  out.  The  vectors  vp.ne.xi  gives  the  index  of  the  next  successor  in  each  sublist,  vp  vimi  the 
current  “sum”  of  each  sublist,  and  vp.procJd  the  virtual  processor  id.  The  scalar  vp.n  is  the  number  of 
currently  active  virtual  processors. 

In  order  to  avoid  having  to  check  whether  a  processor  has  reached  the  end  of  a  sublist  at  every  pointer 
dereference  we  modify  the  parallel  algorithm  described  in  the  previous  section.  During  initialization  at  the 
tail  of  each  sublist  we  destructively  set  ll.u>  .rl  to  its  own  index  to  create  a  self  loop  and  set  ll.riilin  to  zem. 
where  zero  is  the  identity  value  of  the  scan  operator.  In  this  way,  we  can  repeatedly  add  the  tail  value  to  the 
sublist  sum  without  affecting  the  sum. 

Finally,  we  use  a  set  of  vectors  st  to  save  information  about  the  sublists.  So  that  we  can  restore  II  before 
returning  from  List  .Scan  we  save  the  random  indices  of  the  tails  of  sublists  in  si. random,  the  values  of  the 
tails  in  .si.  value,  and  the  successor  links  at  the  tails,  namely  the  heads  of  the  sublisis  in  si. head.  During  the 
course  of  the  algorithm  we  .save  intermediate  results,  .si. util,  the  index  of  the  tail  of  each  sublist.  si. sum.  the 
sum  of  each  sublisl,  sl.next,  he  index  of  the  the  successor  sublist. 

LisT-Scan  .starts  by  calling  INITIALIZE  which  sets  up  the  sublists  and  returns  the  number  of  sublists 
created.  In  Phase  I  LisT-Scan  alternates  between  traversing  each  sublist  and  packing  out  completed  sublists 
until  no  sublists  remain.  iNtriAL  JIank  traverses .«{/]  links  of  each  sublist  summing  the  values  at  each  node 
(in  Section  4,3  we  discuss  how  we  determine  the  values  of  .n(/)).  Then  Initial-Pack  load  balances  the 
remaining  lists  by  removing  the  completed  sublists  from  vp.  U  saves  the  results  of  the  completed  sublisis  in 
vpand  removes  them  from  .t/by  packing  the  remaining  sublists  to  the  initial  portion  of  the  arrays.  By  packing 
the  arrays  we  effectively  reassign  virtual  processors  to  element  processors.  Finally.  iNiTtAL.PACK  returns 
the  number  of  incomplete  sublists  remaining.  After  all  the  sublists  have  completed.  FiND-Si'Bt.iST.Li.ST 
forms  a  linked  list,  sl.next,  of  the  sublist  sums,  .sl..sum.  In  Phase  2  it  finds  the  list  scan  of  this  linked  list 
by  calling  either  List  .Scan  recursively,  the  vector  multiprocessor  routine  WylliE.  or  the  serial  routine 
SerialXisT-Scan.  The  results  are  put  in  the  virtual  processor  array  vp.smn.  Phase  is  like  Phase  I  and 
proceeds  by  alternating  between  traversing  the  subiists  for .«{/]  nodes  and  packing  out  finished  sublists  until 
no  subiists  remain.  Finally.  Restore  J-isr  puts  back  the  original  links  and  values  at  the  tails  of  the  sublists 
All  the  routines  are  vectorized  except  SerialXisT-Scan. 
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ListJScan(  ll.sum.  ll) 

{ 

/•  Phasc.l  •/ 

/  =  0;  /•  Initialize  the  pack  counter  */ 

vp.n  =  i/./i  =  lNmAUZE(vp,i/. //);  /*  Initialize  the  virtual  proces.vors  */ 

while  ( vp.n  >  0)  {  /*  Firtd  the  sublist  sums  •/ 

Initial-RankI  vp.  ll.  j[/  ++ ); 
vp.n  =  lNmALj*ACK(vp.i/, //); 

} 

Find^UBUSTJ-IST(//,5/);  /•  Turn  the  sums  into  a  list  •/ 

/*  Phase-2  •/ 

If  ( il.n  >  wylliejcutoff) 

List.Scan(  vp.sum.  sly,  /*  Recursive  */ 

elseif  ( sin  >  serialj:utoff) 

Wyu,1E(  vp.sum.  sly, 

else 

SERIAU-LiSTJSCANI  vp.sum.  .sl.ne.xt.  sl.sum.  0); 

/*  Phase.3  */ 

/  =  0; 
vp.n  =  il.n; 
while  (vp.n  >  0)  { 

Final-Rank(  vp.  ++] ); 
vp.n  =  Final  J’ack  ( vp.s/./l); 

} 

RESTOREXiSTIi/.//) 

} 

Initialize:  Initialization  starts  by  finding  sl.n,  the  appropriate  number  of  sublists  to  use  given  the  length  of 
the  linked  list.  In  Section  4.3  we  di.scuss  how  to  determine  what  is  an  appropriate  value  for  sl.n.  Gen  .Tails 
finds  sl.n  pseudo-random  positions  in  the  linked  li.st.  .sl.raiuJom.  which  are  to  be  the  tails  of  the  sublists 
It  also  ensures  that  none  of  these  positions  are  the  tail  of  the  whole  list.  To  simplify  the  implementation 
we  chose  to  use  equally  spaced  positions  and  assumed  that  the  linked  lists  are  randomly  ordered.  If  the 
ordering  of  the  links  is  random  then  we  can  expect  sublist  lengths  to  follow  the  same  distribution  as  when 
the  heads  of  the  lists  are  chosen  randomly.  Next  Initialize  saves  the  links  and  values  at  the  tails.  The 
head  of  the  first  sublist  is  the  head  of  the  whole  linked  list.  Because  the  links  are  retrieved  from  random 
positions  to  retrieve  ll.neM  at  sl.rundom  requires  a  load  and  a  gather,  and  to  save  sl.head  requires  a  store. 
Then  Initialize  gathers  ll.v<due  at  .d. random  and  stores  them  in  sl.vtdue. 


/*  Reset  the  pack  counter  */ 

/*  Reset  number  of  virtual  processors  */ 
/*  Find  the  scan  of  the  sublists  */ 


/*  Restore  values  at  sublist  tails  */ 
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Next  Initialize  turns  the  linked  list  into  u  set  of  sublists  by  setting  the  tails  to  self  loops  and  tail  values 
to  zero.  As  the  tails  are  at  random  positions  these  assignments  require  two  scatter  operations.  We  need  not 
worry  about  the  value  of  the  tail  of  the  whole  list  in  Phase  1  because  we  do  not  need  the  correct  sum  for  the 
tail  sublist  when  finding  the  scan  in  Phase  2.  Finally  Initialize  initializes  the  virtual  processor  vectors,  it 
stores  the  heads,  stores  cenn's  at  the  sums,  and  stores  the  processor  id's. 

Initlalizef  vp.  sL  ll) 

{ 

sin  =  COMPLrrE-NUM.SUBLISTS( //.«); 

Gen.Tails(  s/.random.  s/.«. //.rt ); 

5/./iea(/(0)  =  ll.heacl: 

for  ( i  =  1 ;  /  <  sl.n-.  /  +-i- )  { 
sl.head[i]  =  lt.next[il.ritn(lom[i\]-. 
sl.value[i\  —  ll.value[!il.randon^i]\. 

tl.value[sl.random[i]]  =  ZERO'. 
ll.nex^sl.random[i\\  -  sl.random[i]', 

} 

for  ( i  =  0;  i  <  st.iK  /  -H- )  ( 

I’p.rie.rff/j  =  .sl.head[i\-. 

vp.sum[t]  =  ZERO'. 
vp.proc[i]  =  /; 

} 

} 

The  time  for  Initialize  in  Cray  C-9<)  clock  cycles  (4.2  nsec)  is 

/'inniaii/c( '" )  =  I3w  +  8701). 


/•  Rnd  random  positions  •/ 

/*  Set  he.ad  of  first  sublist  */ 

/*  Save  tails  of  sublists  */ 

/•  Gather  heads  and  save  */ 

I*  Gather  values  at  tails  and  save  */ 
/*  Set  up  sublists  */ 

/*  Scatter  zero  at  tails  values  */ 

/•  Scatter  self  loops  at  tails  */ 

/*  Initialize  virtual  processors  '/ 

/*  Store  heads  */ 

/*  Initialize  sums 
/*  Assign  processor  id's  *1 


where  m  is  the  number  of  sublists. 

Initial.Rank:  InitiaL-Rank  traverses  each  sublist  for  nMeps  times  computing  the  Mim  of  the  links 
Because  the  weight  of  the  tail  is  zerr>.  it  can  repeatedly  accumulate  the  sum  at  ihe  tail  without  affecting  the 
sum.  Each  traversal  of  the  vector  of  links  requires  retrieving  the  values  and  links  at  arbitrary  locations  in  II. 
Thus,  it  uses  two  gather  operations.  To  increment  the  sum  requires  loading,  adding  to.  and  storing  vp.smn. 
Finally  it  needs  to  store  the  current  link  vp.ne.xt. 


I.S 


InidalJlank(  vp,  //.  tuteps ) 

{ 

for  [j  =  0,j  <  tiMeps',  V  -H- 1  /*  dereference  n^teps  times  on  each  sublist  */ 

for  ( i  =  0;  i  <  vp.n\  i  ++ )  { 

vp.sum[i]  +  =  ll.value[vp.next[i\];  /*  Gather  value  and  increment  sum  */ 

vp.next[i]  =  //./iext[vp.nejt/(i]];  /*  Gather  successor  link  */ 

} 

} 

The  time  for  inner  loop  of  InitiaL-Rank  is 

^IniiialJtank  «ep(  •T )  =  3.4x  +  80. 

where  x  is  the  vector  length  of  vp. 

Strip  mining  currently  is  performed  on  the  inner  loop.  However,  it  would  be  more  efticient  to  do  strip 
mining  of  the  inner  loop  outside  the  outer  loop.  In  this  way.  we  would  not  need  to  loud  and  store  intermediate 
results  between  successive  iterations  of  the  outer  loop.  The  only  way  to  get  this  effect  on  the  Cray  is  to 
unroll  the  inner  loop.  We  did  not  do  this  optimization. 

Initial.Pack:  After  traversing  the  sublists  .s,  steps  List  .Scan  packs  out  any  completed  list.  Packing 
requires  saving  the  results  of  the  completed  lists  according  to  their  processor  id’s  and  then  packing  the 
remaining  lists  in  vp  so  that  they  are  contiguous.  A  sublist  is  complete  if  the  virtual  processor  has  reached 
the  tail  of  the  sublist,  which  is  a  self  loop.  To  test  for  a  self  loop  requires  loading  vp.na.xt,  gathering  tl.ne.xi 
at  vp.next,  and  testing  whether  the  two  are  equal.  There  are  two  ways  we  could  get  the  compiler  to  save  the 
completed  lists.  One  is  to  compute  the  indices  of  the  completed  lists  and  then  using  these  indices  to  gather 
vp.sum,  vp.next,  and  vp.procJd.  Then  it  can  scatter  vp.sum  to  si. .sum  and  vp.ite.st  to  sl.iuil  at  the  indices  in 
vp.procJd.  The  other  way  is  to  load  vp.sum,  vp.next,  and  vp.pmvMl,  change  them  so  that  all  active  sublists 
use  one  of  the  completed  sublist  values  by  using  a  vector  merge  operation.  Then,  as  before,  it  can  scatter 
vp.sum  to  sl.sum  and  vp.next  to  .sl.tail  at  the  indices  in  vp.pnn  Jd.  The  effect  is  to  have  all  active  sublists 
write  to  the  same  location  in  sLsum  and  .sl.tad.  This  approach  causes  much  memory  contention  because 
most  of  the  sublists  are  active.  Clearly,  the  former  approach  is  better  and  is  what  we  used. 


IiiitialJ*ack(  vp,  si.  It) 

{ 

j  =0; 

for  ( »  =  0;  <  <  vp./t;  <  -h-  )  { 

If (vp.nMt[i]  ==  U.next[vp.next[t]\)  { 
sl.sum[vp.pmc  Jd[i\]  =  vp.iMffi(/l; 
sl.tail[vp.pmcJd[i}]  =  vp.nexi[i]: 

}else{ 

vp.pmcJJ[j]  =  vp.prt)cJ4/(']; 
vp.iw»i[j]  —  vp.iu/n(i]; 
vp.nextli ++]  =  up./t«/(<]; 

} 

} 

return  j\ 

) 


I*  Save  completed  sublists  *1 
/*  Gather  and  scatter  sum  */ 

/•  Gather  and  scatter  tail  */ 

/*  Pack  remaining  sublisis  ♦/ 

/*  Gather  and  store  processor  id's  */ 
/•  Gather  and  store  current  sum  */ 
/*  Gather  and  store  current  link  */ 


/*  Number  of  remaining  sublists  "/ 


Packing  the  remaining  active  sublists  is  done  by  computing  the  indices  of  the  active  sublists  and  for 
each  vector,  vp.nexi,  vp.sum  and  vp.prncJd,  gathering  the  vector  using  the  active  indices  and  then  storing 
contiguously.  The  time  for  one  application  of  Initial  J>ack  for  vectors  of  length  i  is. 


f  Inilial.Psck  slcpl-'")  —  ^ +  54<). 


Find^ublistXist:  At  this  point  each  sublist  has  reached  its  tail  and  is  ready  to  start  Phxse  2.  Recall 
that  sl.tail  holds  the  tail  of  each  sublist,  while  si.ramlom  holds  the  tail  of  the  previous  sublisi.  Therefore, 
when  Fino^UBLISTXist  writes  the  sublist  index  to  ll.next  at  sl.rundom,  then  it  is  writing  the  index  of  the 
successor  sublist  to  the  tails  of  the  sublists.  (We  write  to  ll.next  because  we  can  easily  regenerate  the  self 
loops  there.)  This  write  requires  loading  sl.random  and  then  scattering  the  index  to  ll.next  at  si  random. 
Note  that  sl.random  does  not  contain  the  index  of  the  tail  of  one  sublisi.  namely  the  tail  of  the  whole  list. 
Therefore,  if  it  writes  the  negative  index  it  can  distinguish  between  values  set  at  sl.random  and  the  original 
self  loops  in  ll.next.  Next  FinD-SublistXist  gathers  these  indices  from  ll.next.  but  this  time  using  sl.tail. 
These  indices  are  the  indices  of  the  successor  sublists  as  long  as  they  are  negative.  Only  one  index  is 
positive  and  it  is  the  index  of  the  tail  of  the  whole  list.  Notice  that  the  writing  and  reading  of  the  indices  is 
done  in  separate  loops  because  the  reading  of  the  indices  may  not  he  done  in  the  same  order  as  the  writing. 
That  is,  we  need  to  be  sure  that  the  write  is  complete  before  the  read  starts  and  that  no  chaining  is  allowed. 


Find^ublistXist(  //.  si) 

} 

for  (<  =  l;i  <  sl.n-.i-H-) 
ll.next[sl.rcmdom[i]\  =  -r, 
for  ( i  =  0;  i  <  sin-,  i  ++ )  { 
next  = 

sinext[i\  =  -next\ 
it  (next  >  0){ 
sLnext{i\  =  <; 
sirandom[Q]  —  next: 

=  lLvalue[next\: 
U.value[nexi\  =  ZERO: 

] 

} 

for  ( i  =  0;  /  <  sin:  i  ++ )  { 
ll.next[st.tait[i]]  -  sl.tail[i]: 
i/.ium(«]  +=  sLvatue[st.nexi[i]]: 
vp.nextli]  =  sl.head[i]: 

} 

} 


/•  Scatter  index  of  next  sublist  */ 

/*  Create  list  of  sublists  •/ 

/•  Gather  index  of  next  sublist  */ 
/•  Store  the  index  */ 

/•  Found  tail  of  whole  list  */ 

/•  Set  tail  sublist  to  self  loop  */ 
/•  Save  tail  of  whole  list  */ 

/’  Save  its  value  */ 

/•  Set  tail  value  to  ZERO  */ 


/*  Scatter  self  loops  at  tails  */ 

/*  Gather  tail  values  and  increment  sum  •/ 
/*  Reset  virtual  processor  heads  */ 


Once  Find^UBLiST  J-iST  finds  the  tail  sublist  it  sets  the  tail  of  sl.nexi  to  a  self  loop,  saves  the  tail  of  the 
whole  list  and  its  value  in  sl.random[0]  and  s/.va/Me{0],  and  sets  the  value  of  tail  of  the  whole  list  to  zerx>. 
Note  that  it  was  not  necessary  to  set  the  value  of  the  tail  of  the  whole  list  to  zem  for  the  Phase  I  because  we 
do  not  need  the  correct  sum  of  the  tail  sublist  to  find  the  scan  of  the  reduce  list,  but  in  Phase  3  we  need  tail 
value  set  to  zero  because,  otherwise,  we  may  repeatedly  add  the  tail  value  to  the  scan  at  the  tail. 

Finally,  FlND^UBLlSTilST  returns  the  tails  of  the  sublists  to  self  loops,  which  requires  loading  sl.iuil 
and  scattering  it  to  ll.next.  Since  the  tail  values  were  never  added  to  the  sublist  sums  during  Initial  JIank 
it  next  adds  the  tail  values  to  the  sublist  sums.  The  tail  values  were  saved  in  sl.vdliie  during  Initialize  by 
the  successor  sublist  and  therefore  must  be  indexed  by  sl.next.  It  loads  sl.siim.  gathers  .v/.  value  using  sl.next. 
adds  the  values  to  the  sums,  which  are  then  stored.  Lxsily.  it  reinitializes  the  virtual  processor  heads  in 
anticipation  of  Phase  3.  Reinitializing  the  heads  requites  loading  sl.head  and  storing  it  in  vp.next. 

The  time  for  FinD-SublistXist  is 


/Rn(l_SuhliMXM("')  =  +  770. 


where  m  is  the  number  of  sublists. 
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Scan  of  the  Reduced  List:  Next  List^can  rtnds  the  list  scan  on  the  sublisis  sums  si. sum  using  the 
list  sl.next.  If  the  list  is  large  then  it  finds  it  recursively.  If  list  length  lies  between  the  recursive  cutoff 
and  the  serial  cutoff  it  uses  Wyllie.  If  the  list  length  is  small  it  uses  Serial.List.Scan.  The  time  for 
SERIAL XlST-SCAN  is: 


f’senalXiM-Stanl  '«  )  =  -W.  1  III  +  255. 


where  m  is  the  number  of  sublists. 

FinalJtank:  In  Phase  3  List  .Scan  repeatedly  calls  Final  JIank  to  traverse  the  sublists  for  njsteps  steps 
The  scan  of  each  sublist  is  found  in  the  same  manner  as  m  Pha.se  I .  The  only  difference  is  that  Final  JIank 
scatters  the  resulting  scan  vp.sum  to  ll.sum  at  the  current  positions  in  vp.next. 

FinalJlankf  //.  vp.  nMeps ) 

{ 

for  {j  =  0-,j  <  njiteps:  j  t-4- ) 
for  ( /  =  0;  I  <  vp./r.  /■»-►)( 

ll.sum{vp.next[i]]  =  /*  Loud  and  scatter  sums  ‘■7 

s'p.sum[i]  ll.v<ilutf{vp.ne.xt[i]\',  /•  Gather  value  and  increment  sum  */ 
vp.nexi[i]  =  ll.ne.xi{vp.next[iW',  I*  Gather  successor  link  and  store  7 

} 

} 

The  time  for  one  iteration  of  the  inner  ioop  of  Final jTank  is; 

/ FinjI.Kani  nIci»' •*' )  “  f"  MU), 
where  j:  is  the  number  of  sublists  remaining. 

FinalJ^ack:  The  packing  step  in  Phase  3  is  a  little  simpler  than  the  packing  step  in  Phase  I  Only  the 
completed  lists  need  to  write  their  sums  to  ll.sum  becau.se  the  active  sublisis  will  write  their  sums  on  the 
next  call  to  FinalJIank.  However  it  is  faster  to  simply  load  all  of  vp.  siim  and  scatter  to  ll.sum  than  it  is 
to  compute  the  indices  of  the  completed  sublists  and  to  gather  vp.sum  and  scatter  them  to  ll.sum  For  active 
sublist.  Final  J*ACK  packs  vp.sum  and  vp.ne.xt  as  in  Initial.Pack.  However,  it  does  not  need  to  keep  track 
of  the  vp.prvcJd. 
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FinalJ'ackl  vp,  si.  ll) 

{ 

y  =0; 

for  ( i  =  0;  <  <  wp.n;  i  -h-  )  { 

ll.sum[vp.next[i]]  =  i>p.sum{i\\  /*  Load  and  scatter  sums 

if  ( vp.nex:[i]\  =  U.next[vp.next[i]\ )  {  /*  Pack  remaining  sublists  */ 

vp.sum[j]  =  vp -tumfij;  /•  Gather  and  store  sums  •/ 

vp.next[j  -H-)  =  vp.next[t]:  /•  Gather  and  store  links  */ 

} 

} 

return  y; 

} 

The  time  for  Final-PacK  is: 

nil  .Pack  \iep(  -T  )  —  +  4tX). 

where  x  is  the  number  of  sublists  to  be  packed. 

RestoreXist:  Finally  each  processor  returns  the  original  links  and  values  at  the  sublist  tails  This  requires 
loading  si. random,  si. head,  and  si. value  and  scattering  to  ll.next  and  ll. value  using  si. random.  Because  the 
tail  of  the  whole  list  is  supposed  to  be  a  self  loop  anyway,  it  does  not  set  II  next  at  the  tail 

RestoreXist!  si.  ll) 

{ 

ll.value[sl.random[(l\]  =  i/.i’d/uefOj;  /*  Reset  value  of  list  tail  */ 
for  ( I  =  I ;  I  <  sin,  i  ++ )  { 

ll.nexl[sl.random{i]]  =  sl.head[i]-.  /•  Scatter  links  at  tails  */ 
ll.value{sl.random[i]]  =  j/.v«/ue[(];  /•  Scatter  values  at  tails  *1 

) 

} 

The  time  for  RestoreXist  is 

I  RcMorea-ivi!  ui )  —  A/ll  4"  .50. 

where  m  is  the  number  of  sublists. 

4  Analysis  of  the  Algorithm 

In  Phase  1  and  3  of  the  algorithm  we  periodically  perform  load  balancing  so  that  processors  that  have 
finished  their  sublists  are  removed  from  the  computation.  We  would  like  to  pack  as  sixm  as  there  are  several 
lini.shed  sublists.  However,  if  we  pack  too  frequently  we  pack  none  or  only  a  few  sublists,  and  when  there 
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are  many  sublists  packing  is  expensive.  If  we  do  not  pack  often  enough,  we  may  have  many  processor, 
petforming  needless  work  repeatedly  chasing  the  sublists'  tails.  In  order  to  determine  when  are  good  nmes 
to  pack  we  first  need  a  better  understanding  of  what  the  expected  distribution  of  the  sublists  lengths  are 
We  find  an  estimate  of  the  distribution  in  Section  4.1.  Next,  in  Section  4.2  we  determine  what  is  the  overall 
cost  of  performing  the  algorithm,  given  the  timing  data  in  Section  3.  In  Section  4. .3.  given  n  the  length 
of  the  linked  list.  ni  the  number  of  sublists,  and  'i  the  number  of  ranking  steps  to  perform  before  the  first 
pack,  we  determitic  how  to  minimize  the  costs  of  the  packs  and  the  unnecessary  tail  chasing.  In  Section  4  4 
we  discuss  how  to  Hnd  m  and  s,  given  u.  Finally,  we  summarize  the  costs,  by  giving  an  estimate  of  the 
overall  performance  of  the  algorithm  and  companng  it  with  the  actual  performance  The  mam  theorem  of 
this  section  is. 

Thtorem  1  The  list  rankin/;  algorithm  inihts  paper  has  expected  nme  (){  n/p-^  n  In  in/m  lan  p  processors, 
whe.i  m  <  n/  log  n. 

4.1  /Vnalysis  of  sublist  lengths 

In  this  section  we  show  that  the  distribution  of  the  lengths  of  the  sublists  is  approximately  a  negative 
exponential  distribution,  when  n  and  m  are  large.  The  analysis  is  from  Feller  1 10|.  We  first  consider  the 

following  situation.  Let  .V| . V„,  be  m  random  numbers  in  the  range  (0.1 )  For  truly  random  numbers 

Probf.V,  =  .V,,)  =  0  for  i  4  j.  Therefore,  the  numbers  partition  (0.  )|  into  /»  I  subiniervaK  Let 
.V(  I) . Y|,„|  denote  the  .Vs  ordered  by  their  sizes  from  smallest  to  largest 

Proposition  2  (Feller  [10|)  //  \  i . \  ,„  are  independent  and  uniformly  <li\irihiiied  over  the  ranve  lO.I  i 

then  as  ni  —  x  the  successive  intervals  in  our  partition  behave  as  ilioiigh  they  are  miiiiially  independent 
exponentially  distributed  variables  with  f.'l 

Lemma  3  If  .\'i . V,„  are  independent  and  timlormlv  distributed  over  the  range  lO.I  i  then 

/’/•<>/<{  .Vi I)  >—}=(  I - r  ”  •  > 

n,  III 

Proof:  The  length  of  (0.  .V(  | ))  exceeds  ’/in  iff  all  .V|. . .  ,  Vy  are  in  the  interval  (  :^.  I  l  Because  the  events 
are  independent  and  uniformly  distributed  the  probabiiity  of  combined  events  happening  is  i  I  -  —  i  As 
III  —  X  this  probability  tends  i  Thus,  the  distribution  of  the  rirsi  interval  in  the  limit  is  .i  negative 
exponential  with  mean  nr' .  | 

Lemma  4  l without  proof)  If  .\ i . Vc  follow  an  e.xponential  distribution  with  expei  leil  value  p  ' '  then 

V|  -f  ■  •  ■  +  .Vc  follows  a  gamma  distribution 

•  ~  I 

(•ii.kl-r)  =  Prt)h{'<it  +  -F  .V*  <  .r}  =  I  - 

1=0 


Lemma  5  //.V| 
fixed  k 


are  independent  and  uniformly  distributed  over  the  runye  tO,  1 1  then  for  e\  er\ 


which  is  the  tail  of  the  gamma  distribution  ( !,„x- 


Proof:  In  order  for  X^^  >  less  than  k  of  the  X's  lie  in  the  range  (0.  ^  i  Because  the  V  events 
are  independent  and  uniformly  distributed  the  probability  that  exactly  j  >f  the  A 's  lie  m  the  range  i  D-  i 
follows  a  binomial  distribution  with  probability  of  success  equal  to  ^  and  probability  of  failure  equal  to 
I  -  That  is. 


ni(  III  -  I  )  •  •  •  ( III  -  j  +  I  I  /  ' 
III '  J  ! 


'Ij' 


To  obtain  Prob{  }  we  need  to  sum  over  the  range  ^  =  0 . ( t  -  I )  | 

Proof  (of  Proposition); 

A' HI  is  the  sum  of  the  Hrst  k  intervals.  A,i).  A(>|  -  \|,i . V,*  i  -  V,;  -i,  In  the  limn  as,  m  —  x. 

A'h  I  follows  a  gamma  distribution  with  parameters  m.k.  which  is  the  distribution  of  the  sum  of  /,  imitiially 
independent  exponential  variates  with  expectation  -  Therefore,  the  successive  intervals  m  the  limn  behave 
at  though  they  are  mutually  independent  exponential  vanates.  | 

Returning  to  the  distribution  of  sublist  lengths.  In  our  case,  we  are  chimsing  m  random  positions  in 
a  list  of  length  ii.  Berra;  -  'hese  are  random  positions  we  can  assume  the  list  is  laid  out  in  order  from 
left  to  right.  Then  the  lengths  of  the  sublists  are  the  intervals  determined  by  m  random  integral  numbers 
V| . from  0  to  II  -  I .  Let 


h.)  “  h'-n 

"  -  >■(...! 

If  II  >  III,  II  —  X,  and  III  —  X  then  the  lengths  of  the  sublists  tend  to  behave  as  mutually  independent 
exponential  variates  with  expectation  That  is.  if  /.  is  a  sublist  length,  then 

Prob{  I,  >  J'}  ^  I  ~  ~  =11. 


I.ii  = 

I.,  = 


If  we  let  II  =  ( III  +  .5)/(  III  +  I )  and  solve  for  .r  we  get  an  estimate  of  the  expected  length  of  the  sho-test 
sublist  of  III  +  I  sublists,  namely 


Exp( 


Ill 


/  III  +  I 
Viii  +  .s 


) 


If  we  let  a  =  .5/( ;«  +  1 )  and  solve  for  x  we  get  an  estimate  of  the  expected  length  of  the  longest  sublist. 
namely 

Exp(  s:  —  ln(2( m  +  1 )). 

ut 

In  general,  we  can  estimated  the  expected  length  of  the/"*  smallest  sublist  by  settings  =  ( ;//-/>. 5 )/( /ii  +  1 1 

and  solving  for  X.  This  estimate  seems  to  be  reasonable  for  n  and  m  as  small  xs /;  >  UXfOandin  >  KKIfor 

all  but  the  smallest  sublist.  A  better  estimate  for  the  smallest  sublist  seems  to  be 

„  ,  ;i  )))  +  I 

ExfX  Lfo, )  =  —  ln( - 1. 

Ill  III 

Figure  1 1  shows  the  expected  length  of  the  i'‘‘  sublist  for  several  values  of  m  when  n  =  KMKXIunJ  compares 
it  to  some  actual  d.ata  averaged  over  20  samples.  Notice  that  as  m  increases  the  expected  length  of  the 
longest  sublist  decreases  and  there  is  less  variation  in  list  lengths.  Therefore,  to  reduce  the  parallel  running 
time  we  want  to  moke  /»  large.  However,  as  in  increases  the  costs  due  to  packs,  initialization,  and  Phase  2 
incivases. 
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Figure  II:  The  function  cures  are  the  expccud  length  of  the  »'''  shortest  sublist  when  n  =  HMKX)  lor  several 
values  of  III.  the  number  of  sublists.  The  observed  lengths  were  tound  by  taking  20  samples  ot  dividing  a  list  ol  si/e 
II  =  KXKX)  into  III  sublists  and  for  the  collection  of  the  /"’  shortest  sublist  ol  each  sample,  tinding  the  average  length 
(shown  with  a  data  symbol)  and  the  minimum  and  maximum  lengths  (shown  with  aii  error  hari. 


4.2  Cost  of  the  algorithm 

U.sing  the  timing  equations  of  each  piece  of  the  algorithm  we  can  determine  what  the  cost  of  the  complete 
algorithm,  assuming  we  know  the  exact  lengths  of  the  sublists  and  when  packs  are  performed.  Assume  we 

traverse  s, .  /  =  I . /  links  of  each  list  between  packs.  Let  .S',  be  the  total  number  of  links  traversed  in 

each  list  before  the  i'''  pack.  That  is, 

.S'n  =  0 

■S',  =  t  =  I . I 

.S,  =  .S',  -.S',_,.  1  =  1 . /. 
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Let  rf(x)be  the  expected  number  of  sublists  that  have  length  greater  than  .1 .  From  the  previous  section 


f/(r)  =  III  X  ProM  sublist  length  >  .1)  (I) 

a:  tn(~~  (2) 

The  dotted  line  in  Figure  1 2  shows  x )  when  «  =  l(X)()Oandiii  =  2(X).  The  x-axis  is  the  sublist  length 
and  the  y-axis  is  the  number  of  sublists  with  that  length.  You  can  think  of  each  sublist  as  being  laid  out 
from  left  to  right,  and  placed  one  above  the  other  from  longest  to  smallest,  each  starting  at  x  =  0  That  is. 
the  y-axis  is  the  number  of  sublists  that  are  still  active  in  the  computation,  namely  the  vector  lengths  of  the 
computations,  while  the  x-axis  is  the  number  of  links  traversed  in  each  list.  As  we  proceed  from  left  to 
light,  we  are  performing  list  ranking  on  a  vector  of  length  equal  to  the  height  of  the  step  function.  Every 
time  we  perform  a  pack  operation  (at  the  comer  of  a  step)  the  vector  length  decreases.  The  area  under  the 
step  function  in  Figure  1 2  is  the  expected  total  number  of  links  traversed  in  either  Pha.se  I  or  Phase  3.  If  we 
packed  every  step  then  this  number  would  be  n.  the  area  under  the  curve  i/i  r  |.  Our  aim  is  to  minimize  the 
area  under  the  step  function  that  is  above  the  dotted  line,  while  keeping  the  cost  of  packing  down.  The  cost 
of  packing  is  proportional  to  the  sum  of  the  heights  of  the  step  function. 


FiRure  12:  The  dotted  function  is  ijij-)  the  expected  number  ot  suhlisi.s  ihai  have  length  greaicr  than  j  .  where 
II  =  lOOOOand  m  =  2(X).  When  the  number  of  packing  steps  is  1 1.  ihe  expected  execution  nine  on  the  Cray  C-d()  is 
minimized  by  pocking  at  the  vertical  lines.  The  step  function  shows  the  expected  number  of  sublisis  that  are  currenily 
active  at  every  iteration  of  list  ranking.  The  size  of  a  step  is  the  expected  number  of  suhlists  to  complete  since  the 
previous  pack. 

Because  the  .s,’s  are  the  same  for  both  Pha.se  I  and  3,  we  can  combine  the  costs  of  Initial J?ank  and 
Final  JIank  to  get  a  single  cost  equation  for  ranking.  Similarly  we  can  combine  the  costs  of  Initial. Pack 
and  Final  J*ACK  to  get  a  single  cost  equation  for  packing.  Thus,  the  costs  of  a  single  rank  step  and  a  single 
pack  are; 


/'Kunf  sicpl ' )  =  ».4./  -).|8() 
fVatk  acp(  '■ )  =  I  +  94<). 


where  x  is  the  vector  length  over  which  the  rank  and  the  puck  are  taking  place.  The  expected  total  time  of 
all  the  packs  are; 


/-I 

Tf^k  =  52n3!/(S)  +  940l 
1=0 

/-I 

=  !35]r,(.V.)  +  94()/. 

1=0 

where  g{  5, )  is  the  expected  number  of  sublists  remaining  after  .V,  steps  of  list  ranking.  Because  we  do 
steps  of  list  ranking  between  packs,  the  expected  total  time  of  the  list  ranking  is: 

/-I 

rR„k  =  '*^>1 

1=0 

/-I 

=  8.4  52'‘.+iff(.V,)  +  180.V/ 

/aO 

>  8.4h  +  I8().V,. 

where  Si  is  the  first  S,  greater  than  the  length  of  the  longest  sublist.  If  we  pack  everytime  a  sublist  completes 
then  53l=o  •‘>1 )  =  »•  However,  in  general  we  delay  packs  and  the  sum  is  greater  than  n  (see  Figure  1 2V 

Similarly,  we  can  combine  the  times  of  Initialize,  Find.SublistXist.  and  Restore.List  since  they 
depend  on  the  number  of  sublists  only.  These  combined  times  are: 

/’oihcrl  "> )  =  -b'"  +  '^720. 

Thus,  the  expected  total  time  for  Pha.se  I  and  3  the  algorithm  is 

F pi+w  =  /'Rank  +  Fpatk  +  /  ouK-r  I  I 

Minimizing  the  time  given  fixed  parameters 

Suppose  we  are  given  u  the  length  of  the  original  linked  list,  m  the  number  of  sublists  and  /  the  number  of 
times  to  pack.  How  do  we  decide  when  to  pack?  The  simplest  way  is  to  divide  I  in  lo  the  expected  length 
of  the  longest  sublist  and  pack  every  fixed  number  of  inteivals.  However,  from  the  previous  discussion  we 
know  that  the  lists  do  not  drop  out  at  a  constant  rate.  That  is.  </(  •'  )  the  expected  number  of  suhlisis  that 
have  length  greater  than  r  is  not  linear,  it  is  exponential.  In  this  section  we  show  .u  which  iterations  of  lisi 
ranking  we  should  pack  so  as  to  minimize  the  expected  execution  lime  of  the  algorithm,  given  n.  m.  and  / 
In  the  next  section,  we  describe  how  we  determine  m  and  /  given  u. 

Consider  r»,  in,  and  I  fixed.  We  want  to  minimiK  the  execution  time  with  respect  to  .S|.  S] .  S^ . S. 

where  S,  is  the  number  of  iterations  of  list  ranking  that  have  iKCurred  at  the  i''‘  pack.  That  is.  we  want  to 
minimize  the  following  function: 

(-1 

T />! +  (’■»( Si\ . Si)  =  ^[(•V,+i  -  -V, )(«,f/( .S',)  li)  ■¥  riii.s, )  +  (/]  +  ( III  +  /.  (4) 

II 
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where  TRank  siep(  -T )  =  tlx  +  1},  rpack  N«ep(  x)  =  cx  +  il,  and  ro,he,(  j- )  =  f  r  +  /.  We  can  minimize  Equalion  4 
by  taking  partial  derivatives  for  each  .S',  and  setting  to  zero  to  obtain  a  set  of  /  simultaneous  equations.  That 


—  =  fi(.S',+i  -  .V. )//'(. S', )-(«</(. V. )  +  /»)  +  («#/(. V,_|  )  +  />)  + (  (/(.V, )  =  0 

Ub, 


'j'{  )  = 


^(.V,-|)  -g<.S. ) 
■S,+i  —  .S',  +  7 


Each  equation  specifies  that  for  a  set  of  three  consecutive  .S"s.  .S,_|.  .S,.  and  .S,^.|  the  S,  is  located  where 
the  slope  of  </  at  .S,  is  equal  to  the  slope  of  the  line  through  the  points  (  S', .  v(  S,_i ) )  and  ( .s,.|.|  +  7.  </(  s  )  1. 
see  Figure  13. 


Miblbit  IcnRlh 

FiRurc  13;  Time  is  minimized  when,  for  each  set  of  three  con.secutive  .S's.  .S,  _ , .  .S, ,  and  .S,  + , .  the  slope  i/l  >.  I  is 
equal  to  the  slope  of  the  line  through  the  points  (.S',  and  (.S',.).  I  +  7.;/(.S',)).  S'  is  the  point  where  time  would 

be  minimized  if  there  was  no  cost  for  packs. 

Because  7  is  the  exponential  function  it  is  not  obvious  how  to  solve  for  S,  given  .S,_|  and  .S,  +  i .  However, 
it  is  not  difficult  to  solve  for  .S', .).i  given  .S',  and  S', _i  (or  to  solve  for  .S,_|  given  s,  and  .S,^.| ).  Namely, 


.V,+,  =  .S', - 


=  .S',  + 


7(.S',-i )  -  .«/(.S',)  ( 

~  7, 

v(  S- 1 )  -  V(  b, )  _  (• 


since  7(z)  =  nif'  That  is,  if  we  know  the  value  of  two  consecutive  packing  points  we  can  determine 
the  following  (or  previous)  packing  point.  Since  we  know  So  =  m.  if  we  know  .S|,  we  can  compute 
S's . S'/  iteratively. 

The  vertical  lines  in  Figure  12  were  found  using  .S|  =  14.7  and  the  equations  in  Section  .3.  Notice  that 
the  .S',’s  become  increasingly  further  apart  for  larger  /’s  reHecting  the  fact  that  the  late  sublists  complete 
slows  down  over  time.  The  factor  r/ti  in  Equation  6  reflects  the  relative  cost  of  packing  and  ranking.  To 
see  the  effect  of  this  factor,  consider  how  the  spacing  of  packs  over  all  the  iterations  changes  if  we  keep  the 
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total  number  of  pocks  the  same  but  increase  the  value  of  r.  When  c  is  increased,  packing  would  occur  less 
frequently  during  the  initial  iterations  of  ranking  and  occur  more  frequently  during  later  iterations.  That 
is,  the  vertical  lines  in  Figure  12  would  be  further  apart  for  small  iteration  numbers  and  closer  together  for 
large  iteration  numbers  if  we  inciease  r.  This  rellects  the  fact  that  initially  packing  is  expensive  because 
the  vector  lengths  are  long  and  later  packing  become  less  expensive  because  the  vector  lengths  are  short.  If 
we  make  c  large  enough  eventually  we  find  that  the  execution  time  is  reduced  by  decreasing  the  number  of 
packs  even  though  the  number  of  ranking  steps  increxses.  In  the  next  section  we  consider  how  to  determine 
the  best  number  of  packs  to  perform. 

We  can  simplify  Equation  4  by  using  equation  6  to  substitute  for  S,  + 1  -  .V,  That  is. 

i-i  i-i  /-I 

-  .V, )(«!/( .V.)  +  6)  =  «-V||r/(.Vo)  +  rt  ^(.V,  +  |  -  .S',  )y(.S', )  + +  |  -  .S,  ) 

I)  I  II 

I-I 

=  iiiii.S'i  +  «  — (</l  .S',_| )  -  (/(.S',)) - </(.S,)j  4- li.S/ 

~  )»  " 

/-I 

=  (iiii.S'i  +  (III  —  (/(.s, )  +  /iSi. 

I 

where  .S/  ^  ^  In  m.  Thus, 

/'/•l+/>\(.V| . .S’()  s:  till  +  li —  In  III  +  (((.S|  4-  I-  +  (  )m  t  III  i-  (. 

Ill 

For  the  Cray  C-9() 

Tri+n  «  8.4ii  4-  180—  Inm  4-  (8.4.S'i  4-  }9)iii  4-  940/  4-  9720.  i  7) 

III 

where  m  is  the  number  of  sublists.  ,S'i  is  the  number  of  links  traversed  before  ihe  first  pack.  /  is  the  number 
of  packs  and  is  a  function  of  .S  i .  m.  and  n. 

4.4  Overall  vector  performance 

From  the  previous  discussion  we  havea  way  to  determine  at  which  iterations  we  should  pack,  if  we  know  the 
length  of  the  whole  linked  list,  the  number  of  sublisis,  and  the  iteration  number  of  the  hrst  pack.  However, 
all  we  know  is  the  the  length  of  the  whole  linked  list,  namely  n.  We  now  need  to  hnd  good  choices  for  the 
number  of  sublists  m  and  the  iteration  number  of  the  first  pack  .S|.  which  determines  the  number  packs  /.  Our 
approach  is  to  e.stimate  the  running  time  of  the  algorithm,  using  Equation  7.  for  vanous  values  of  m.  s  |  anu 
II.  Then,  for  each  value  of  n  we  find  vaiues  of  in  and  .S|  that  minimized  the  running  time  within  about  two 
percent.  Finally,  we  fit  functions  to  in  r.v.  n  and  .V(  v.v.  n.  It  appears  that  m  and  ,V|  arc  approximately  cubic 
polynomials  of  log  n.  If  is  these  htted  polylog  functions  that  we  use  in  our  implementation  to  determine  n< 
and  .V|  given  n  and  Equati'an  6  to  find  successive  values  of  S,. 

However,  we  found  that  .V|  was  a  very  sensitive  parameter.  From  the  previous  section  our  intuition 
is  that  packing  should  occur  less  frequently  as  we  proceed  though  the  processing.  However,  if  S  |  is  too 


small,  the  packing  steps  become  rapidly  closer  and  closer  until  we  are  packing  at  every  iteration  of  list 
ranking.  The  result  is  far  too  much  packing  and  performance  degrades  rapidly.  To  protect  ourselves  from 
this  sensitivity  we  modihed  Equation  6  so  that  successive  .Vs  are  always  increasing.  With  this  modification 
we  found  the  the  fitted  cubic  log  functions  performed  very  well  in  practice. 

Figure  l4compares  the  predicted  time  with  the  observed  running  time.  The  predicted  time  was  computed 
by  estimating  the  parameters  values  for  each  value  of  n  using  the  fitted  cubic  equations  and  then  applying 
the  equation  3  for  those  parameter  values.  As  the  figure  indicates  the  equation  is  an  accurate  predictor  of 
the  running  time.  Notice  that  the  running  time  decrea.ses  until  it  reaches  an  asymptote  of  about  8.6  clocks 
per  element. 


Figure  14!  The  predicted  perf'onnance  and  measured  performance  of  the  vectori/ed  Li.st.Scan  on  ime  processor 
Cray  C-90.  The  values  for  the  parameters  m  and  .s'l  were  determined  by  minimi/.ing  the  predicted  performance. 

5  Vector  Multiprocessor  List  Scan 

In  Section  3  we  showed  how  we  implemented  the  vectorized  version  of  the  data  parallel  list  scan  algorithm.  In 
this  section  we  show  that  extending  the  algorithm  to  multiple  vector  processors  is  relatively  straightforward. 
We  then  discuss  issues  relating  to  the  vector  multiprocessor  version  performance  and  its  speedup  with 
respect  to  the  single  vector  processor  version.  Finally,  we  relate  our  algonthm  to  other  parallel  PRAM 
algorithms  and  explain  why  we  chose  not  to  implement  them. 

The  overall  approach  is  to  divide  the  virtual  proces.sors  equally  among  the  physical  vector  processors 
and  let  vectorization  proceed  on  the  virtual  processor  data  a.ssigned  to  the  physical  processors.  The  Cray 
C  compiler  makes  parallelizing  relatively  easy.  Lo  ps  are  modified  to  be  tasked  U>v>ps  using  compiler 
directives  so  that  different  iterations  of  the  loops  are  divided  among  the  processors.  Because  our  arrays  are 
often  longer  than  the  vector  length  and  we  know  that  the  loops  can  be  vectorized,  we  chose  to  direct  the 
compiler  to  divide  the  loops  into  equal  size  chunks,  one  chunk  per  processor,  and  to  vectorize  the  chunk 
within  each  processor. 

For  MIMD  proces.sors  we  al.so  tried  to  minimize  the  number  of  synchronization  points.  Data  parallel 
algorithms  assume  that  each  data  parallel  step  is  synchronized,  whether  nr  not  it  is  necessary.  In  the  cixie 
we  presented  in  Section  3  we  need  to  synchronize  at  most  after  every  innermost  loop.  In  particular,  we 
must  synchronize  after  the  loops  in  Initialize  and  Fino.SublisT-List  and  after  Pha.se  I  and  Phase  for 
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correctness.  If  we  use  the  parallel  algorithm  as  described  in  Section  2  and  have  equal  number  of  active 
virtual  processors  assigned  to  physical  processors  at  all  times,  we  also  need  to  synchronize  before  each  pack 
in  Phase  1  and  Phase  3  so  that  load  balancing  can  proceed  globally  across  the  physical  processors. 

However,  we  deviated  somewhat  from  this  strict  form  of  assignment  of  virtual  processors.  Instead  we 
assign  them  to  physical  processors  once  at  the  beginning  and  pack  locally  within  each  physical  processor 
only.  In  this  way  each  processor  completes  all  of  Phase  I  and  Phase  3  independently  of  the  other  processors. 
The  effect  is  that  we  need  to  do  no  synchronization  within  Phase  I  or  Phase  3  and  there  is  no  load  balancing 
across  processors.  Eliminating  synchronization  avoids  needless  delays  at  each  synchronization  point.  No 
global  load  balancing  across  processors  is  important  because  most  compilers  do  not  know  how  to  do  a  pack 
operation  across  processors  in  parallel.  Of  course,  with  some  effort  we  could  apply  loop  raking  to  get  a 
vector  multiprocessor  algorithm  for  pack  (5|. 

Because  we  use  randomization,  we  do  not  expect  a  significant  load  imbalance  when  we  only  load 
balance  locally.  Even  if  an  imbalance  should  become  a  problem  as  the  the  procedure  progresses,  only  one 
across-processor  load  balancing  should  be  necessary.  Our  results  are  quite  good  without  any  global  load 
balancing.  If  we  tgnore  load  imbalance  and  synchronization  costs  we  can  get  an  estimate  of  the  execution 
time  of  Phase  I  and  Phase  2  by  dividing  the  vector  lengths  equally  among  the  processors.  Namely. 
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where  /;  is  the  number  of  processors  and  m  <  n/  log  n. 

Unfortunately,  to  tune  the  parameters  m  and  .S\  we  need  to  minimize  for  every  possible  number 
of  processors.  For  a  highly  or  massively  parallel  machines  tuning  the  parameters  for  every  number  of 
proce.s.sors  would  not  be  practical.  We  tuned  the  parameters  for  I.  2.  -t.  and  S  processors  and  result  in  the 
execution  times  shown  in  Figure  15.  For  8  processors  we  achieve  a  speedup  of  b.7. 
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5.1  Other  work  efficient  list  ranking  algorithms 

On  one  Cray  C-90  vector  processor  our  algorithm  takes  about  10  clock  cycles  per  list  element  asymptotically 
to  find  the  rank  or  scan  of  a  linked  list.  If  other  algorithms  are  to  be  competitive,  they  must  be  able  to  use 
no  more  than  10  cycles  per  element  on  average.  Below  we  discuss  various  other  algorithms  that  have  been 
described  in  the  literature.  Except  for  Wyllie's  pointer  jumping  algorithm  on  short  linked  lists,  we  conclude 
that  other  algorithms  are  unlikely  to  be  competitive. 

Cole  and  Viskin  devised  a  parallel  deterministic  coin  tossing  technique  [7]  which  they  used  to  develop 
an  optimal  deterministic  parallel  list  ranking  algorithm  (8,  4].  This  algorithm  breaks  the  linked  list  into 
sublists  of  two  or  three  nodes  long  (the  heads  of  the  sublists  are  called  2-ruling  sets);  reduces  the  sublists 
to  a  single  nodes;  and  then  compacts  these  single  nodes  into  contiguous  memory  to  create  a  new  linked 
list.  It  recursively  applies  the  algorithm  to  the  new  linked  list  until  the  resulting  linked  list  is  less  than 
(n/  log  n),  at  which  point  it  applies  Wyllie's  algorithm.  In  the  final  phase  it  reconstructs  the  linked  list  by 
unraveling  the  recursion  in  the  first  phase  to  fill  in  the  rank  values  of  the  removed  nodes.  The  algorithm 
runs  in  0{  log  n  log  log  n )  parallel  time  and  uses  Ol»)  steps.  Later  they  modified  their  algorithm  to  give  the 
first  O(logn)  time  optimal  deterministic  algorithm  (9,  22].  However,  algorithms  for  finding  2-ruling  sets 
'hat  give  either  of  these  time  bounds  are  quite  complicated  and  have  very  large  constants.  They  also  give  a 
much  simpler  2-ruling  set  algorithm  that  is  not  work  efficient  but  has  smaller  constants  (see  |4|).  Because 
it  is  not  work  efficient  and  its  constants  are  larger  than  Wyllie's  or  ours,  we  chose  not  to  implement  it. 

Anderson  and  Miller  (2)  combined  their  randomized  algorithm  with  the  Cole/Viskin  deterministic  coin 
tossing  to  get  an  optimal  O(logn)  time  deterministic  list  ranking  algorithm.  As  with  their  randomized 
algorithm,  the  processors  are  assign  log  n  nodes  which  they  process.  At  each  round  each  processor  executes 
a  case  statement  that  either  breaks  contention  or  splices  out  a  node  in  its  queue  or  splices  out  a  node  at 
another  processor’s  queue.  To  break  contention  it  finds  a  log  log  n  ruling  set.  Finding  log  log  ruling  sets 
are  much  simpler  (0(  I )  time)  than  finding  2  ruling  sets  (0( log  n )  time  with  large  constants).  But  because 
each  round  involves  a  nonparallel  three  way  case  statement,  where  each  case  needs  to  be  completed  by  all 
the  processors  before  the  next  case  can  be  executed,  its  constants  are  also  much  larger  than  ours. 

The  basic  structure  of  Cole  and  Viskin’s  algorithms  is  simitar  to  the  structure  of  our  algorithm.  The 
main  difference  is  that  we  break  the  linked  list  into  a  relatively  few  sublists  that  can  be  quite  long,  whereas 
Cole  and  Viskin  divide  the  linked  list  into  more  than  n/3  sublists  that  are  only  two  or  three  node;,  long.  To 
get  such  fine  grain  list  lengths  is  quite  expensive  and  needs  to  be  repeated  (M  log  " )  times.  We  find  sublists 
only  a  few  times  and  because  our  sublists  are  relatively  long,  we  can  process  the  lists  at  full  speed  for  their 
entire  length.  The  primary  reason  our  algorithm  is  so  successful  is  because  it  has  very  small  constants 
and  is  work  efficient.  And  as  long  as  the  number  of  processors  is  small  relative  to  the  size  of  the  list  the 
parallel  running  time  is  optimal.  The  success  of  the  implementation  is  due  to  pipelining  reads  and  writes 
through  vectorization  to  hide  latency,  minimizing  load  balancing  by  deriving  equations  for  predicting  and 
optimizing  pert^^rmance,  and  avoiding  conditional  tests  except  when  balancing  points. 
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6  Conclusions  and  Future  Directions 


In  this  paper  we  described  a  new  parallel  algorithm  and  its  implementation  for  list  ranking  and  list  scan.  List 
ranking  and  list  scan  are  primitive  operations  on  lists  and  the  building  block  of  many  parallel  algorithms 
using  lists,  graphs,  and  trees.  Because  of  their  expected  poor  performance  on  today  's  supercomputers,  there 
are  virtually  no  implementations  of  algorithms  using  these  data  structures.  Our  contribution  is  that  we  have 
implemented  the  most  basic  of  these  “untouchable”  algorithms  on  the  Cray  C-90  with  success. 

One  of  the  primary  problems  with  any  list  ranking  algorithm  is  that  the  access  pattern  to  the  list  is  very 
irregular  and  unpredictable.  But  fortunately,  the  Cray  class  of  computers  have  a  very  fast  memory  access 
network  that  makes  implementing  a  list  ranking  algorithm  reasonable.  It  is  Cray's  pipelined  memory  access 
and  extremely  high  global  bandwidth  that  makes  our  implementation  so  fast. 

Although  the  parallel  running  time  of  our  algorithm  »  In  m/m ),  where  m  is  small  relative  to  //, 

it  is  work  efficient.  And  because  it  is  work  efficient  and  has  small  constants  it  is  the  fastest  implementation 
list  ranking  and  list  scan  to  date.  Most  parallel  list  ranking  algorithms  attempt  to  hnd  a  large  numbei. 
least  0{  II /  log  II )  and  as  many  as  ii/2,  of  nonadjacent  elements  in  the  list  and  assign  them  equally  among 
the  processors.  Our  algorithm  only  tries  to  find  a  relatively  small  number,  m.  of  such  elements.  However, 
the  amount  work  assigned  to  each  processor  can  be  quite  different.  But  by  a  unique  analysis  of  the  expected 
work  loads  we  are  able  to  determine  at  what  iterations  to  perform  load  balancing  to  minimize  the  overall 
running  time  of  the  algorithm. 

As  with  any  implementation  there  are  a  multitude  of  possible  modification  and  enhancements  that  could 
improve  its  performance.  A  large  pan  of  the  performance  loss  is  ilue  to  short  vector  lengths.  As  lists  drop 
out  of  the  computation  the  vector  lengths  shorten.  Not  only  are  the  vector  lengths  short,  the  number  of 
iterations  remaining  with  short  vector  lengths  can  be  relatively  large,  since  the  longest  sublists  can  be  much 
longer  that  the  other  sublists.  Short  vectors  are  inefficient  because  with  each  iteration  there  is  a  latency  due 
to  filling  the  vector  pipes.  On  the  Cray  computers  this  inefficiency  is  fairly  small,  because  these  machines 
have  particularly  small  vector  half  performance  lengths.  But  many  vector  machines  have  quite  long  vector 
half  performance  lengths.  For  these  machines  it  may  be  better  to  reconnect  the  sublists  into  a  single  reduced 
sublist  before  all  the  processors  have  reached  the  tails.  The  elements  still  remaining  in  the  lists  could  then 
be  packed  into  contiguous  memory  and  then  Pha.sc  I  recursively  applied.  Keeping  track  of  which  elements 
have  been  processed  and  which  have  not,  requires  extra  book  keeping  that  would  slow  down  the  main 
ranking  portion  of  the  algorithm.  But  the  trade  off  may  be  woah  it  if  the  vector  machine  has  long  vector 
half  lengths. 

Finally,  the  question  still  remains  whether  having  a  fast  list  ranking  implementation  is  useful  as  a 
primitive  for  other  major  applications.  If  so.  we  may  have  opened  up  major  classes  of  PRAM  algorithms 
that  can  have  reasonable  implementations.  It  also  would  be  interesting  to  see  whether  our  approach  of 
subdividing  a  problem  randomly  into  a  moderate  number  of  fairly  coarse  gain  subproblems  and  applying 
load  balancing  periodically  can  be  applied  to  other  computational  problems. 
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